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Abstract — /-divergences are a general class of divergences 
between probability measures which include as special cases 
many commonly used divergences in probability, mathematical 
statistics and information theory such as Kullback-Leibler diver- 
gence, chi-squared divergence, squared Hellinger distance, total 
variation distance etc. In this paper, we study the problem of max- 
imizing or minimizing an /-divergence between two probability 
measures subject to a finite number of constraints on other /- 
divergences. We show that these infinite-dimensional optimization 
problems can all be reduced to optimization problems over small 
finite dimensional spaces which are tractable. Our results lead 
to a comprehensive and unified treatment of the problem of 
obtaining sharp inequalities between /-divergences. We demon- 
strate that many of the existing results on inequalities between 
/-divergences can be obtained as special cases of our results and 
we also improve on some existing non-sharp inequalities. 

Index Terms — /-divergences, Ali-Silvey divergences, sharp in- 
equalities, extreme points, Choquet's theorem, Pinsker's inequal- 
ity, minimax lower bounds. 



I. Introduction 

SUPPOSE that the Kullback-Leibler divergence between 
two probability measures is bounded from above by 2. 
What then is the maximum possible value of the Hellinger 
distance between them? Such questions naturally arise in many 
fields including mathematical statistics and machine learning, 
information theory, probability, statistical physics etc. and the 
goal of this paper is to provide a way of answering them. From 
the optimization standpoint, this problem can be viewed as that 
of maximizing the Hellinger distance subject to a constraint 
on the Kullback-Leibler divergence over the space of all pairs 
of probability measures over all possible sample spaces. We 
shall prove in this paper that the value of this maximization 
problem remains unchanged if one restricts the sample space 
to be the three-element set {1, 2, 3}. In other words, in order 
to find the maximum Hellinger distance subject to an upper 
bound on the Kullback-Leibler divergence, one can just restrict 
attention to pairs of probability measures on {1,2,3}. Thus, 
the large infinite-dimensional optimization problem is reduced 
to an optimization problem over a small finite-dimensional 
space (of dimension < 4) which makes it tractable. 

In this paper, we prove such results in a very general 
setting. The Kullback-Leibler divergence and the (square of 
the) Hellinger distance are special instances of a general 
class of divergences between probability measures called /- 
divergences (also known as i/i-divergences). Let / : (0, oo) ^• 
M be a convex function satisfying /(I) = 0. By virtue of 
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convexity, both the limits /(O) := limj-^o/l^^) ^nd ./'(oo) := 
lini^too f{x)/x exist, although they may equal +oo. For two 
probability measures P and Q, the /-divergence (see, for 
example, l|l)-l|4))' ^/(^HQ)' is defined by 

Df{P\\Q) ■■= j^^^ f (^^) dQ + r{^)P{q - 0} 

where p and q are densities of P and Q with respect to a 
common measure A. The definition does not depend on the 
choice of the dominating measure A. Special cases of / lead 
to, among others, Kullback-Leibler divergence, total variation 
distance, square of the Hellinger distance and chi-squared 
divergence. 

We are now ready to introduce the general form of the 
optimization problem we described at the beginning of the 
paper Given divergences Df and Df.,i = l,...,m and 
nonnegative real numbers Di, . . . , Dm, let 



and 



.,Dm) sup{Df{P\\Q) : Df^{P\\Q) < A Vz} 



B{Di, . . .,D„,) mi{Df{P\\Q) : Df^{P\\Q) > A Vz} 

where the probability measures on the right hand sides above 
range over all possible measurable spaces. The goal of this pa- 
per is to provide a method for computing these quantities. We 
show that these large infinite-dimensional optimization prob- 
lems can all be reduced to optimization problems over small 
finite-dimensional spaces. Specifically, in Theorem 2.1 we 



show that in order to compute these quantities, one can restrict 
attention to probability measures on the set {1, . . . , m + 2}. 

One of the main reasons for studying the quantities 
A{Di, . . . , Dm) and B{Di, . . . , Dm) is that they yield sharp 
inequalities for the divergence Df in terms of the divergences 
D f-^, . . . , D f^. Indeed, the inequalities 

Df{P\\Q)<A{Df,{P\\Q),...,Df^{P\\Q)) (1) 



and 



Df{P\\Q) > B{Df,{P\\Q), . . .,Df^{P\\Q)) (2) 



hold for every pair of probability measures P and Q. Fur- 
ther, the functions A and B satisfy the natural monotonicity 
inequalities 



and 



AiD,,...,Dm)<A{D[,...,D'm) 



B{Du...,Dm)<B{D[,...,D'm 



(3) 



(4) 



for every {Di, ... , Dm) and {D[, . . . , D'm) such that Di < D[ 
for all i. 

The inequalities ([TJ and (|2]i are sharp in the sense that A 
is the smallest function satisfying (|3]l for which ([TJ holds for 
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all probability measures P and Q. Likewise, B is the largest 
function satisfying (j4| for which ^ holds for all probability 
measures P and Q. 

Inequalities between /-divergences are useful in many ar- 
eas. For example, in mathematical statistics, they are crucial in 
problems of obtaining minimax bounds |5J-[8J. In probability, 
such inequalities are often used for converting limit theorems 
proved under a convenient divergence into limit theorems for 
other divergences [9|-|11|. They are also helpful for proving 
results in measure concentration p2)-|[T4]|. Some applications 
in machine learning are described in P151. Further, inequali- 
ties involving /-divergences are fundamental to the field of 
information theory p6) , p7) . 

Because of their widespread use, many papers deal with 
inequalities between /-divergences (some references being |2|, 
||8), ||T8)-p5|, p3|). However, many of the inequalities pre- 
sented in previous treatments are not sharp. The few papers 
which provide sharp inequalities pT[ , p3)-p5| only deal 
with certain special /-divergences as opposed to working in 
full generality. A popular such special case is m = 1 and 
Df-^ corresponding to the total variation distance. In this case, 
sharp inequalities have been derived in |j23) for the case when 
Df is the Kullback-Leibler divergence and in f24] for the 
case of general Df. The only paper which deals with sharp 
inequalities for to > 1 is |25| but there the authors only study 
the case when D f-^, . . . , D are all primitive divergences (see 



Remark 3.1 below for the definition of primitive divergences). 

In contrast with all previous papers in the area, we study the 
problem of obtaining sharp inequalities between /-divergences 
in full generality. In particular, our main results allow to to be 
an arbitrary positive integer and all the divergences Df and 
D f-^, . . . , D f^ to be arbitrary /-divergences. We show that the 
underlying optimization problems can all be reduced to low- 
dimensional optimization problems and we outline methods 
for solving them. We also show that many of the existing 
results on inequalities between /-divergences can be obtained 
as special cases of our results and we also improve on some 
existing non-sharp inequalities. 

The rest of this paper is structured as follows. Our main 



result is stated in Theorem 2.1 Its three-part proof is given in 



Section III The first part is based on a recent representation 



theorem for /-divergences which implies that the optimization 
problems for computing A{Di, . . . , D„i) and B{Di, . . . , D„i) 
can be thought of as maximizing or minimizing an integral 
functional over a certain class of concave functions satisfying 
a finite number of integral constraints. In the second part of 
the proof, we use Choquet's theorem to restrict attention only 
to the extreme points of the constraint set. Finally, in the third 
part, we characterize these extreme points and show that they 
correspond to probability measures over small finite sets. 

In Section |IV| we collect some remarks and extensions 
of our main theorem and, in particular, we show that the 
theorem is tight in general. In Section |V] we consider various 
special cases and show that many well-known results in the 
literature can be obtained as simple instances of our main 
theorem. In Section |VI] we describe numerical methods for 
solving the low-dimensional optimization problems that come 
out of our main theorem. We solve an important subclass of 



these problems by convex optimization and we also describe 
heuristic methods for the general case. 



II. Main Result 

For each n > 1, let Vn denote the space of all probability 
measures defined on the finite set {!,... ,n}. Let us define 
An{Di, . . .,D,n) to be 

sup {Df{P\\Q) -.P^QeVn and Df.{P\\Q) < D, Vi} 

and, analogously, Bn{Di, . . . , to be 

inf {i^/(P||Q) ■.P,QeVn and > A Vi} . 

Our main theorem is given below. The second part of the 
theorem requires that Df^,..., Df^ are finite divergences. We 
say that a divergence Df is finite if suppg Df{P\\Q) < oo. 
The supremum here is taken over all probability measures over 



all possible measurable spaces. See Remark 3.2 for a detailed 
explanation of finite divergences. 

Theorem 2.1: For every Di, . . . , Dm > 0, we have 



A{Di, D„,) = A„-,+2{Di, . . . , Dm). 



(5) 



Further if D f-^ , . . . , D f^ are all finite, then 

B{Di, . . . ,Dm) = Bm+2iDu . . . , Dm). (6) 

The conclusions of the above theorem may be better appre- 



ciated in the following optimization form. Theorem 2.1 



states 



that the quantity A{Di, . . . , !?,„) equals the optimal value of 
the following finite-dimensional optimization problem; 



maximize 

p,9e[o,i]"+^ 

subject to 



^ ,,/(f)+/'M E p. 

j:qj>0 ^^•'^ j:qj=0 

Pj > 0, qj > for all j = 1, 



, TO 



(7) 



j-qj>o 



1o 



for z = 1, . . . , TO. Similarly, when D f^, . . . , D f^ are all finite, 
B{Di, . . . , Dm) equals the optimal value of 



minimize 

p,ge[0,l]"+2 

subject to 



E 9,/(^)+/'M E 

j:qj>0 ^^^^ j:qj=0 

Pj > 0, Qj > for all j = 1, . . . , TO + 2 



(8) 



E i^-f' 

j-<lj>0 



+ /;(oo) E Pj > A 

j-Qj=0 



for J = 1, . . . , TO. The proof of Theorem |2.1| is provided in the 



next section. In Section IV we argue that Theorem 2.1 is tight 
in general and also comment on the assumption of finiteness 
of D f-^, . . . , D f^ for the validity of identity (|6]l. 
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III. Proof of the Main Result 
A. Testing Representation 

For two probability measures P and Q, let us define the 
function V'p.Q ■ [0, oo) — > [0, 1] by 

'0p,q(s) '-^ / vam{p,qs)dX for s e [0,oo) 



where p and q denote the densities of P and Q with respect 
to a common measure A (which can, for example, be taken 
to be P + Q). This function '4'p,Q is nonnegative, concave, 
non-decreasing and satisfies the inequality < iPpq{s) < 
min(l,s) for all s > 0. In other words, ip E C where C 
denotes the class of all functions on [0, oo) that are non- 
negative, concave, non-decreasing and satisfy the inequality 
^(s) < min(l,s) for all s > 0. Moreover, it is true (see, for 
example, p5] Corollary 5]) that every function £ C equals 
ipp^Q for some pair of probability measures P and Q. 

For each divergence Df, let us associate the measure Vf on 
(0, oo) defined by 



for < a < < oo 



where denotes the right derivative operator (note that by 
convexity f{x) exists for every x S (0, oo)). We also 
associate the functional If : C [0, oo] by 



(min(l, s) — ip{s)) dvf{s). 



(9) 



There is a precise connection between Df and // that is given 
below: 

Lemma 3.1: For every pair of probability measures P and 
Q, we have 

Df{P\\Q)=If{ijp,Q). (10) 



Lemma 3.1 is not new although the form in which it is stated 
above is non-standard. The more standard version simply 
involves writing the integral in (|9]l over the interval (0, 1) by 
the change of variable t = s/(l + s). In this modified form. 
Lemma 3.1 has been proved in ||26]| in the case when / is 
twice differentiable and in 127) in the general case. A short 



proof is available in p8| Theorem 2.3]. 

Remark 3.1 (Primitive /-divergences): For each s > 0, let 
Us{t) := min(l, s) — mm{t, s) for t S (0, oo). Clearly, Us is a 
convex function on (0, oo) such that Us{l) — 0. Moreover, it is 
a very simple convex function in the sense that it is piecewise 
linear with just two linear parts. It is straightforward to check 
that the divergence corresponding to Us is given by: 

i?„,(P||Q) =min(l,s)- Vp,q(s)- 



Lemma 3.1 therefore asserts that any arbitrary /-divergence 



can be written as an integral of the primitive divergences Du^ 
with respect to the measure i^f on (0,oo). The most well- 
known of these primitive divergences is the total variation 
distance which corresponds to s = 1. Indeed, 

D^,{P\\Q) = I- J mm{p,q)dX j \p-q\d\ =: V{P,Q) 

Every primitive divergence Du^ {P\\Q) is closely related to the 
smallest weighted average error (Bayes risk) in the problem of 



statistical testing between the hypotheses P against Q based 
on an observation X (see, for example, (25^. Lemma 3]). 

Remark 3.2 (Finiteness of a divergence): Lemma 3.1 im- 
ples that 

/■oo 

snpDf{P\\Q)^ / min(l,s)dz./(s) = /(0) + /'(oo). 

P.Q Jo 

(11) 

The supremum above is taken over all probability measures P 
and Q defined on all possible measurable spaces. To see ( fTT) , 
just note that, by Lemma [TT] we have 



snpDfiPWQ) = sup//(Vp,q) = sup//(^) = lf{0). 
P.Q P,Q jpec 

Intuitively, iPp^q{s) — for all s implies that P and Q 
are maximally separated (mutually singular) and thus the 
maximum value of If{ip) is achieved when ip is the identically 
zero function. The definition of If gives that 



//(0)= / minil, s)d,,f{s) 
Jo 

Moreover, for the probability measures P* — (1, 0) and Q* — 
(0, 1) in V2, the function i/'p,Q equals 0. Therefore, 

If{Q)^Df{P*\\Q*) = f{0) + r{^), 

which proves ( [TT| ). 

Recall that an /-divergence is finite if supp q D f{P\\Q) < 
00. By ([TTJ, an /-divergence is finite if and only if 



min(l,s)dz//(s) = /(0) + /'(cx)) < 



(12) 



Well known examples of finite divergences are the primitive 
divergences, the square of the Hellinger distance and the 
capacitory discrimination (which corresponds to the convex 
function (|49])). 

For each / and D > 0, let us define 



Ciif,D) ■.^{i^eC:If{i^)<D} 



and 



C2if,D) ■.^{^eC:If{^)>D} 
As a consequence of Lemma |3.1| we obtain that 

Api, . . . , i?„) = sup {Ifiij) : V G n™ iCi(/„ A)} (13) 
and 

Bpi, . . . , A„) = inf {//(^) : V e n^iC2(/„ A)} . (14) 

The following lemma on the derivatives of the function ipp^q 
(the left and right derivative operators are denoted by 9' and 
respectively) will be useful in the sequel. 
Lemma 3.2: For every function ip — 'ipp,Q in C, we have 



aV^s) ^Q{p> sq} for s > 



(15) 



and 



d''iP{s) ==Q{p> sq} for s > 0. (16) 
Proof: For every s > 0, 



d ills) = lim 

tiO 
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and 



ip{s) - Tpis - e) 



min(p, qs) — min(p, q{s — e)) 



dX 



It is easy to check that the integrand above is bounded in 
absolute value by q and converges as e J, to g {p > qs}. The 
identity ^T5\ therefore follows by the dominated convergence 
theorem. The proof of ([T6| is similar. ■ 

B. Reduction to Extreme Points 

Let us first recall the definition of extreme points. Let S 
be a subset of a vector space V . A point a € S \s, called an 
extreme point of 5* if a = (6 + c)/2 for b,c G S implies that 
a = 6 = c. In other words, a cannot be the mid-point of a 
non-trivial line segment whose end points lie in S. We denote 
the set of all extreme points of S by ext{S). 

An important result about extreme points in infinite dimen- 
sional topological vector spaces is Choquet's theorem (see, for 
example, [29, Chapter 3]). We shall use the following version 
of Choquet's theorem in this section: 

Theorem 3.3 (Choquet): Let K be a metrizable, compact 
convex subset of a locally convex space V and let xq be an 
element of K. Then there exists a Borel probability measure 
fiQ on K which is concentrated on the extreme points of K and 
which satisfies L{xo) = jj^L{x)dfj,(){x) for every continuous 
linear functional L on V. 

The goal of this section is to prove the following: 

Lemma 3.4: For every Di, . . . , _D,„ > 0, we have 

A{Di, . . . , A„) = sup {Ifi'^) -.^eext (n™ iCi(/„ A))} 
and further, if D f-^ , . . . , D are all finite, we have 
BiDi, ...,Dm)^ inf {//(^) -.^eext (n^iC2(/„ A))} 



Proof: The proof is based on Theorem 3.3 Let C[0, oo) 
denote the space of all continuous functions on [0, oo) 
equipped with the topology given by the metric: 



pif.g) 



k>l 



sup \f{x) 

0<x<k 



(17) 



It is a fact (see, for example, pO[ Chapter 1]) that C[0, oo) is a 
locally convex vector space under this topology. We shall apply 
Choquet's theorem to V" = C[0, oo) and K = n'^iCiif,, D,) 
for the first identity and K = H™ ]^C2(/i, -Di) for the second 
identity. It is obvious that C is a subset of C[0, oo). 

Clearly both the sets niCi(/i,A) and r\iC2{fi,Di) are 
convex. Also, by Fatou's lemma, niCi{fi, Di) is closed under 
pointwise convergence i.e., if ij^n G <^iCi{fi, Di) and ipn ip 
pointwise, then ip £ niCi(/i, Di). To see this, observe that by 
Fatou's lemma, for each z = 1, . . . , to, 

/>oo 

IfiW = / (min(l, s) - -0(5)) rfi^/,(s) 



liminf (min(l, s) — ?/'„(s)) dvf. (s) 



< liminf / {miii{l, s) — 'ipn{s)) di'f .{s) < Di 



On the other hand, if each Df. is a finite divergence, then 
by the dominated convergence theorem, niC2{fi, Di) is also 
closed under pointwise convergence. Indeed, if — ^ V' 
pointwise and Df. is a finite divergence, then by the dominated 
convergence (since < min(l,s) — ipn{s) < min(l,s)), we 
have IfiiiJn) IfiW- 



In Lemma 3.5 below, we show that C is a compact subset of 



C[0, oo) under the topology given by the metric p. Moreover, 
it is easy to see that convergence in the metric p implies 
pointwise convergence. It follows hence that r\iCi{fi, Di) is 
a compact, convex subset of C[0, oo) and if each Df^ is a 
finite divergence, then niC2(/i, Di) is also a compact, convex 
subset of C[0, oo). 

For each e > 0, let us define the functional on C[0, oo) 

by 

Ae(V') = J imm{l,s) ~ i:{s)){e < s <l/e}di^f{s) 

When restricted to the interval [e, 1/e], the measure I'f is a 
finite measure. Hence, A, is a continuous, linear functional 



on C[0, oo). Thus, by Theorem 3.3 we get that for every 
ipo € <^iCi{fi, Di), there exists a Borel probability mea- 
sure To that is concentrated on the set of extreme points, 
ext{n,Ci{fi, D,)), of n,Ci(/i, A) such that 



A,(V'o) = J A,(V^)dTo(V), 
for every e > 0. Now, by the monotone convergence theorem, 

A,{iP)tIfW aseiO 

for every ip £ C As a result, we can use the monotone 
convergence theorem again to assert that 

A,{iP)dToW t J IfWdToW as e i 0. 

We therefore obtain 



Ifi^o) 



I}{^)dTo{i^). 



(18) 



Since this is true for all functions ■0o in r\iCi{fi, Di), we 
obtain 

sup Ifi'ip) = sup If (4') 

i>en,CiUi,Di) 4,eext{niCi{n,D,)) 



The proof of the first assertion of Lemma 3.4 is now complete 



by ([T3|. Similarly, when each divergence Df. is finite, we can 
prove that 

inf IfM) = inf /f(V') 

and this, together with ( [T4| i, completes the proof of 
Lemma [331 ■ 

In the above proof, we used the fact that C is compact in 
C[0, oo), the space of all continuous functions on [0,oo). We 
prove this fact below. 

Lemma 3.5: The class C is compact in C[0, oo) equipped 
with the topology given by the metric ^TT) . 

Proof: We show that C is sequentially compact. Consider 
a sequence {ipn} in C. For every fixed sq e [0,oo), the 
sequence {■(/'« (so)} is a sequence of real numbers in [0, 1] and 
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hence has a convergent subsequence. By a standard diagonal- 
ization argument, we assert the existence of a subsequence 
{ijjnk} of {ipn} that converges pointwise over the set of all 
nonnegative rational numbers (denoted by 

Let us now fix e > and a real number sq £ [0, oo). Choose 
ri,r2 e Q+ such that |ri - < e/4. Also, let > 1 be 
large enough so that 

I'fpniri) - ipm{ri)\ < e/4 for n,m> N 

and for i = 1,2. Using properties of functions in C, we get 
that 

m (^1 ) 

< 2|V'„(ri) - Vm(ri)| + 2|ri - raj < e. 

In the last inequality above, we have used the fact that 
functions in C are Lipschitz with constant 1 (this can be proved 



for instance using the derviatives given by Lemma 3.2 1. It 



therefore follows that the sequence {ip„} converges pointwise 
on [0, oo). The proof is now complete by the observation 
that convergence in the metric p is equivalent to pointwise 
convergence on [0, oo). ■ 

C. Characterization of Extreme Points 



Lemma 3.4 asserts that for the purposes of finding the 
supremum or infimum of // subject to constraints on //., it 
is enough to focus on the extreme points of the constraint set. 
In the next theorem, we provide a necessary condition for a 
function in the constraint set to be an extreme point of the 
constraint set. 

Theorem 3.6: Let ip he a function in r\iCi{fi, Di) and let 
k be the number of indices i for which //.(V') — Di. Then 
a necessary condition for ip to be extreme in niCi{fi, Di) 
is that ip equals ijjp.Q for two probability measures P,Q G 
Vk+2- The same conclusion also holds for extreme func- 
tions in niC2{fi, Di) provided all the involved divergences 
D f-^, . . . , D are finite. 

Remark 3.3: When m = k = 0, the sets niCi{fi, Di) and 
C2{fi,Di) can both be taken to be equal to C. As will be clear 
from the proof, the above theorem will also be true in this case 
where it states that a necessary condition for a function to 
be extreme in C is that -0 equals ipp,Q for two probability 
measures P,Q E 1^2- 

Proof of Theorem 3.6- Let V be a function in 
ext{r\iCi{fi, Di)). Since t/j G C, we can write ?/'(s) — 
iPp,q{s) — J min{p,sq)dX for some probability measures P 
and Q on a measurable space X having densities p and q with 
respect to a common sigma finite measure A. Without loss of 
generality, we assume that 



//,(V') = i?/.(P||Q) = A 



for i — 1, 



(19) 



and 



If^ (ij) = D /, (P| IQ) < A for i = + 1, . . . , m. (20) 
Let aiA"— 7>(— l,l)bea function satisfying 



apdX = / aqdX = 0. 



(21) 



Note that (1 + a)p, (1 — a)p, (1 + a)q and (1 — a)q are all 
probability densities with respect to A. Let P^,P^,Q^,Q~ 
be probability measures having densities p^ :— (1 + a)p, 
P- (1 — a)p, g+ (1 + a)q, g_ (1 — a)q respectively 
with respect to A. Also, let 



tp+i-s) := V'p+,Q+(s) = / {I + a) min{p, sq)dX 



and 



V'-(s) := ipp-n-is) 



so that tp = (-!/)+ + ) /2. For every i = 1, 
that 



(1 — a) min(p, sq)dX 

TO, we observe 



If.iij+)=Df.{P+\\Q+) 



q+fi 



+ f^{^)P+ {q+ = 0} . 



Writing {l + a)p and {l + a)q forp+ and q+ respectively and 
noting that 1 + a > because a takes values in (—1,1), we 
obtain 

//.(V'+) = //.(0) + J audX (22) 

where 



n qf, 
It follows similarly that 



+ /;(oo)p{g = 0}. 



aridX 



(23) 



We observe that J ridX < Di for each i — 1, . . . , to which 
implies that 

J \ari\dX < oo (24) 

for every function a that takes values in (—1,1) and i = 
l,...,m. 

From (19\ , ( [22] i and ( [23] ), it follows that the two inequalities: 



If^{'P+) < A and //.(V'-) < A 
will be satisfied for i = 1 , . . . , fc if and only if 

aridX = for i — 1, . . . , k. 



(25) 



(26) 



Moreover, from ((20|, (|22]) and ((23), it follows that if 
^'^Pxex sufficiently small, then ( |25] l will be satisfied 

also for i = fc + 1, . . . , TO. Let us say that a is a good function 
if it satisfies ( |2T| ) and (|26| and if sup^. |a(a::)| is sufficiently 
small. We have thus proved that if a is a good function, then 
both ipj^. and ip- belong to niCi{fi, Di). Because ip is extreme 
and ip = {ijj++ip-)/2, we assert that ip — -0+ — ip- for every 
good function a. As a result, d''i/j{s) = d^ip^{s) for every 
s > and d^ijj{s) — d^tp+{s) for every s > 0. Because of 
Lemma 3.2 and the relations p+ = {l+a)p and q+ ~ {l+a)q, 



we get that 



aqdX = and 



p>sq 



aqdX ~ 



(27) 



p>sq 
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for every s > 0. On the other hand, the equality sip{l/s) — 
s?/'+(l/s) for every s > implies that iIjq,p{s) = ipQ+^p+{s) 
which, by taking derivatives and using J apdX = 0, yields 

apdX = and / apdX = (28) 

I p>sq Jp>sq 

for every s > 0. We have therefore shown that both ( pT] ) 
and ( |28| l hold for every s > whenever a is a good function. 
We now show that for every decreasing sequence si > • • • > 
Sfc+2 of positive real numbers, the following condition must 
hold 

(29) 



min (P(B,) + Q(i?,)) = 

l<j<A;+3 



where Bi — {p > qsi}, Bi = {qsi < p < qsi^i} for i — 
2, . . . , fc + 2, and B^+s — {p < qsk+2}- The proof would then 
be completed by Lemma [3^ 

We prove (|29]l via contradiction. Suppose that the condi- 
tion ( |29) does not hold for some si > ■ ■ ■ > Sk+2- Let 
a = '^jtXoijlB where ai,...,Q!fe+3 are real numbers in 



(—1,1) and denotes the indicator function of Bj. We 
claim that for this a, the conditions ( |27] l and ( |28| l cannot hold 
unless ai = • • • = afc+3 ~ 0. To see this, note that ( pT] ) 
and (|28]) for s = si give ai{P[Bi) + Q{Bi)) = 0. But 
since P{Bi) + Q{Bi) is strictly positive (we are assuming 
that ( |29] l does not hold), it follows that ai = 0. We now 
use ( |27] i and ( |28] l for s = S2 to obtain 0:2 = 0. Continuing 
this argument, we get that ( |27| ) and ( |28] l cannot hold unless 
ai — ■ ■ ■ = Q!fe+3 — 0. As a result, it follows that 
a — OijiBj is not a good function for every non-zero 

vector (ai, . . . , ak+3) in 

On the other hand, as can be easily seen by writing down 
the conditions pT| ) and (|26]l, for a = '^j^Bj to be a 

good function, maxj \aj\ needs to be sufficiently small and 
the following equalities need to be satisfied: 



k+3 



fe+3 



and 



fe+3 



jdA = for z = 1, . . . , fc. 



If ( [29| l is not satisfied, then the above represent /c + 2 linear 
equalities for the fc + 3 variables ai, . . . ,afc+3. Therefore, a 
solution exists where ai, . . . ,afc+3 are non-zero (and where 
maxj \aj\ is small) for which a — "^j^Sj becomes a 

good function. Since this is a contradiction, we have estab- 
lished (|29l). 



By lemma 3.7 it follows that t/j can be written as ipp'^Q' for 
two probability measures P' and Q' on {1, . . . , A; + 2}. This 
proves the first part of the theorem. The case of r\iC2{fi, D.^) 
is very similar In the above argument, the only place where 
we used the fact that the constraints in r\iCi{fi, Di) are of 
the < form is in asserting ( |24] i. In the case of r\iC2ifi, Di), 
the statement ( p4| i still holds under the assumption that each 
divergence D y . is finite. The rest of the proof proceeds exactly 
as before. ■ 

Lemma 3.7: Let P and Q be two probability measures on 
a space X having densities p and q with respect to A. Let 



Z > 1 be fixed. Suppose that for every decreasing sequence 
si > • • • > s/ of positive real numbers, the following condition 
holds: 

min {P{Bj) + Q{B,)) = Q 
i<j</+i ^ J" 

where Bi = {p > qsi},Bi ~ {qSi < p < 95,-1} for i ~ 
2, . . . ,1 and — {p < qsi}. Then ^p,q can be written as 
i'P'.Q' for two probability measures P',Q' G Vi- 

Proof: Let rj denote the probability measure {P + Q)/2. 
Suppose 

N := {x E (0, 1) : a; = ri{p > qs} for some s G (0, 00)} . 

We claim that is a finite set having cardinality at most l—l. 
To see this, suppose, if possible, that there exist points < 
.Ti < • • • < < 1 in iV. Then, we can write .Tj = ri{p > qSi} 
for some si > • • • > s/ > 0. But then rj{Bi) = xi, r]{Bi) = 
Xi — Xi-i > for i — 2, . . . ,1 and rj{Bi^i) = 1 — xi > 
which contradicts the condition given in the lemma. Let us 
therefore assume that the cardinality of N equals fc < / — 1 
and let N — {xi, . . . , Xk} where < xi < • • • < Xfc < 1. Let 

s* :— sup {s > : ■q{p > qs} = Xi} 

for i = I, . . . ,k. Also let 

Sfc+i := sup {s > : ?7{p > qs} = 1} 

if there exists s > with ri{p > qs} = 1. If there exists 
no such s > 0, we define s^,^-^ = 0. It is easy to see that 
s'l e (0,00] and s'l^-^ e [0,oo) while sl,...,sl G (0,oo). 
Let us first consider the case when < 00 and s^,^-^ > 0. In 
this case, for each i = 1, . . . , fc -|- 1, there exists a sequence 
{tn{i)} with < tn{i) t s* such that 7]{p > qtn{i)} = x^ 
(we take x^j^x = 1). Because the sets {p > qi„(z)} decrease 
to {p > qs\} as n ^ 00, it follows that ri{p > qs*} = 1 for 
each i = 1, . . . , fc + 1. Also it is easy to see that 



v{p > Q^i} — 1™ v{p ^ Q^} 



for each i = 1, . . . , fc -I- 1 where we take a;o — 0. It follows 
therefore that ryjp = qs*} = Xi — Xi-i for 1 < i < fc + 1. 

- = Xk+i — xq = 1, it follows that 

fc+i 

J2v{p^qsn = '^- (30) 



Because 



It can be checked that the above statement is also true in the 
case when s'l — 00 and/or s*^^-^ — provided we interpret 

{p = q ■ 00} = {q = 0} and {p — q ■ 0} — {p — 0}. 

The equality (l30| is the same as 



fe+i 



fe+i 

^P{p-g,s*} = l and ^Q{p = qs*} = l. (31) 



Let Pi = P{p = qs*} and qi = Q{p = qs*} for 
i — 1, . . . , fc + 1 so that P' = {pi, . . . ,pk+i) and Q' = 
{qi, . . . , qk+i) are probability measures on {1, ... , fc + 1}. For 
each i = 1, . . . , fc + 1, we have 

Pi = P{P = QS*} = / pdX = s* / qdX = s*q^ 
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where the above statement is to be interpreted as gi = if 

= oo and as Pk+i = if s^^-^ = 0. Also 

/ niin(p, qs)dX = iTiin{s* , s)Q{p = qs*} = min(pi, g^s) 

J p—qsj 



B. Tightness 



for every s > and z = 1, 2, . . . , fc + 1. Therefore, 

V'p,q(s) = j niin(p, qs)d\ 

fe+i 



^/ niin(p,gs)dA = V'P',Q'(s) 



The proof is complete because fc + 1 < Z. ■ 

D. Completion of the Proof 

We shall prove (|5]l. The proof of (|6]l is entirely analogous. 
Theorem 3.6 states that every function in C^iCl{fi^ Di) that is 



extreme equals V'p.Q for some P,Q £ 'Pm+2- Therefore, by 
Lemma [334] we get that . . . , D^a) equals 

sup{//(^p,q) : ^p,Q e nr^iCi(/„ A) and P, Q G Vra+2} ■ 

Because Ifi{4'p,Q) equals D f.{P\\Q), the constraint t/j E 
r\iCi{fi,Di) is equivalent to Df^{P\\Q) < D, for all i = 
1, . . . , m. The proof is therefore complete. 



IV. Remarks and Extensions 

A. Stronger Version 

The proof of Theorem |2.1| actually yields a smaller 
expression for A{Di, than A,n+2{Di, D^) 
and a larger expression for B{Di, . . . , D,n) than 
Bm+2{Di, . . . , Dm)- For each subset J of {l,...,m}, 
let ^"^(1)1, . . . , Dm) denote the supremum of Df{P\\Q) 
over all probability measures P,Q E 'Pk+2 (where k is the 
cardinality of J) for which Df.{P\\Q) — Di for i E J and 
Df^{P\\Q) < Di for i ^ J. It is clear that 

A''{Di,...,D„,) < A,n+2iDu...,D„,) 

for each J C {!,..., m}. The following is therefore a stronger 
version of Theorem 12.11 



A{Di,...,Drn)= max A' {Di, . . . , D„,) (32) 

JC{l,...,m} 

An analogous statement also holds for B{Di, . . . , Dm)- Let 
us now show that our proof of Theore m |2.1| given in Sec- 
tion III-D| results in ([32|. By Theorem |3.6| every function 
ip in r\iCi{fi,Di) that is extreme equals V'p,q for some 
P,Q E Pk+2 where k is the number of indices i for which 
^fiii') — ^fi{P\\Q) = Di- Therefore, if J denotes these 
indices, then 

If{4>)^Df{P\\Q)<A'{D^,-..,Dm) 

< max A^{Di,.--,Dm) 

JC{l,...,m} 

for every %p E ext{r]iCi{fi, Di)). The equaUty ( |32| therefore 
follows from Lemma l34l 



The conclusion of Theorem 2.1 is tight in the sense that, 
in general, one cannot reduce the optimization problems to 
pairs of probability measures on spaces of cardinality strictly 
smaller than rn + 2. We shall demonstrate this fact in this 
section by means of an example. We also explain this fact 



numerically in Example 6.6 



Consider the problem of maximizing an /-divergence sub- 
ject to a upper bound on the total variation distance. In other 
words, let 

A{V) snp{DfiP\\Q) : V{P,Q) < V} 
where Df is an arbitrary /-divergence. In this case. Theo- 



rem 



2.1 asserts that A{V) equals A:}{V) where, as before. 



AkiV) ■.^sup{Df{P\\Q) ■- P,Q EVk,V{P,Q) <V}- 

We shall show below that when Df is a finite divergence and 
when / is strictly convex on (0,00), the quantity v43(y) is 
strictly larger than A2(y) for all V E (0, 1). 

The quantity A3{V) = A{V) can be determined precisely. 
The easiest way is to use Lemma [3T| Because 

viP,Q) ^ DuAPWQ) = I - i^p,Qa), 

the constraint V{P, Q) < is equivalent to ■0p.q(1) > 1 — ^• 
Therefore, by Lemma |3.1| we get 

A{V) = sup{/j(V') -.^jeC and V(l) > 1 - V"}. 

It is obvious that the supremum above is achieved for ip{s) = 
(1 - V) min(l, s) which equals Vp'.Q' for P' = (1 - V, V, 0) 
and Q' = (1 - V",0,V"). Thus 

AiV)^Df{P'\\Q') = VifiO) + f'{^))- 



In other words, by Remark 3.2 the quantity A{V) equals V 



times the maximum possible value of the divergence Df. 

Let us now consider the quantity A2{V). By compactness 
and the form of the constraint, it follows that there exist two 
probability measures P* and Q* in V2 with V{P*,Q*) — 
V and Df{P*\\Q*) = A2{V). We can then, without loss of 
generality, parametrize P* and Q* by P* = (p, 1 — p) and 
Q* = {p + V,l- p-V) for some 0<p<l-V. Consider 
now the probability measures 



P = 



22' ^ l2 4'2 4' ^ 



in V3. If e (0, 1), by strict convexity of the function /, it 
is easy to see that 

Df{P\\Q) > Df{P*\\Q*) ^ A2{V). 

On the other hand, it is easy to see that V{P, Q) equals V 
and hence A^iV) > Df{P\\Q). Therefore, A^iV) > A2iV). 
Thus, Theorem |2.1| is tight in general. However, in some 
special cases, one can obtain stronger conclusions, see Sec- 
tions ED and ED 
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C. Finiteness assumption for B(Di, . . . , Dm) 

In order to prove (|6]), we required that all the divergences 
D f-^, . . . , D are finite. The reason is mainly technical and 
the finiteness assumption was crucially used in the proof 



of Lemma 3.4 The set niC2{fi, Di) will not be closed (in 



C[0, oo) equipped with the metric p) if some of the diver- 
gences Df. were non-finite (closedness of C]iC2{fi,Di) was 
critical in the application of Choquet's theorem in Lemma [3741 . 
To illustrate this non-closedness, let us consider m = 1 and 
the set C2{fi,Di) for some non-finite divergence Df-^ and 
Di > 0. By ( [T2j l, because Df-^ is non-finite, we have 



min(l, s)di'f-^ (s) 



The function ^o(s) = min(l,s) clearly does not belong to 
C2(/i,-Di) because If^^il^o) = 0. But we shall show that V'o 
belongs to the closure of C2(/i,£'i). Indeed, if 



V'n(s) := ( 1 - - ) min(l, s) 



for s > 0, 



then clearly xjjn converges to in the metric p. Moreover, for 
each 71, if^n G C and 



1 



min(l, s)dvf^ (s) 



oo. 



Thus -iAn e C2(/i,-Di) for each n > 1 which implies that t/^o 
belongs to the closure of C2(/i, -Di). Therefore, C2(/i, -Di) is 
not closed. 

The quantity B{Di, . . . , Dm) behaves strangely when some 
of the divergences Df. are non-finite and when Df is finite. 
Indeed, in this case, one can simply drop the constraints 
corresponding to the non-finite divergences and reduce the 
problem to the case when all divergences are finite. This is 
the content of the next lemma. 

Lemma 4.1: Let D f, D f-^, . . . , D f^ be finite divergences 
and let D f^^-^, . . . , D f^^i be non-finite divergences. Then 



BiDu- 



,D 



m+l ) 



B{Di 



,D„ 



Proof: We shall work with ( [T4] l. Because 



n"! C2(/j, A) is contained in n™ iC2(/„ A), it follows that 
B{Di, . . . , Dm+i) is larger than or equal to B{Di, . . . , Dm)- 
To prove the other inequality, let ifj e n"l]^C2(/i, -Di). For 
each n > 1, define 



Ipnis) 



1 - ^ ) min(l, s),?A(s) 



It is easy to check that ipn G C. Note that for 1 < i < m, 

Ifiii^n) = / {mm{l,s) - ipn{s))diyfi{s) 



> / (min(l,s)-V(s))dz//,(s) = //,(V')> A. 
Jo 

Moreover, for m < i < m + I, we have 

/•CO 

If, (V'n) = / (min(l, s) - ipnis)) dvf^ (s) 
1 

> mm(l, s)diyf.(s) = oo > Di. 

n Jo 



It therefore follows that i/)„ G n™+'C2(/i, A^) for every n > 
1. Consequently, 

If{->P,i) > B{Di, Dm+i) for every n > 1. 

Observe that V'n(s) converges to 7/1(5) for every s > 0. Thus, 
because 13/ is a finite divergence, it follows by the dominated 
convergence theorem that If{tpn) converges to If{ip) which 
results in 

If{^)>BiDu...,Dm+i). 

Finally, because ip g n™2C2(/ijA) is arbitrary, we have 
proved that B{Di,...,Dm) is larger than or equal to 
B{Di, . . . , Dm+i) which completes the proof of the lemma. 



Remark 4.1: If Df is finite and if all the divergences 
■ ,Df^ are non-finite, then Lemma 4. 1 gives that 



B{Di 



,An) =0 



(33) 



for all values of _Di, . . . , Dm- Here is a special instance of this 
result. Suppose that Df denotes the total variation distance, 
771 = 1 and that D/^ is the Kullback-Leibler divergence. 
Then ( (33| ) shows that the smallest value of the total variation 
distance over all probability measures with Kullback-Leibler 
divergence at least 5 (say) equals 0. The same conclusion holds 
for multiple non-finite divergence constraints as well, 
gives a formula for B{Di 
and for finite D f^ , 



Theorem 
bitrary Df 



,D 



, An) for ar- 
we 



In Lemma 4.1 



showed that when Df is finite, then the case when one of more 
of D f-^, . . . , D f^ are non-finite can be reduced to the case 
where all the constraint divergences are finite which is handled 
by Theorem |2.1| The case that we are unable to resolve is 
B{Di, . . . , Dm) when Df is non-finite and when one or more 
of D f^, . . . ,D f^^ are non-finite. This case is neither covered 



by Theorem 2.1 nor by Lemma 4.1 



D. Sufficiency of the extreme point characterization 

In Theorem 1 3. 6 1 we gave a necessary condition for functions 
in the classes n,iCi(/i, Di) and r\iC2{fi, Di) to be extreme. As 
we have seen, this necessary condition was enough to prove 



Theorem 2.1 For the sake of completeness, in this section, we 



investigate whether the condition in Theorem 3.6 is sufficient 
as well for extremity. 

Let j E {lj2} and let ijj be a function in DiCjlfi, Di). 
Suppose ip satisfies the condition given in Theorem |3.6| i.e., 
let — ipp^Q for two probability measures P,Q E Vk+2 
where k is the number of indices where Ifi{ip) — Di. Here, 
we explore the question of extremity of ijj in niCj{fi, Di). 

Let I < fc + 2 be the size of the (finite) support set 
of the measure P + Q and let P = {pi,...,pi} and 
Q = {qi,---, qi}, then ipis) = mi'^ fe' Because 

the size of the support set of P + Q is I, it follows that 
ma,x{pi,qi) > for every i. It is easy to check that ip is 
piecewise linear with knots at pi /qi (this ratio can equal or 
00 as well). 

Suppose that 7/; = (7/^1 + 7/^2)72 for two functions ipi and 
ip2 in <^iCj{fi, Di). Because 7/1 and 7/2 are both concave, it 
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follows that they both have to be linear in the regions where 
is linear. As a result, one can write 

V'i(s) = + ai) mm{pi, qis) 



and 



i=l 



for some ai, . . . , a„ G [—1,1] satisfying 

aiPt = ^ aigi = 0. 



(34) 



i=l 



i=l 



Now, whenever //^(V') = -Di, because of the above, we must 
have If.{4'i) — Di. This latter equality can be written as 
a linear equality in ai, . . . ,q;;. Because If^{ip) = Di for k 
indices i, we obtain k linear equations for ai, . . . , a;. These, 
together with p4l ), give rise to /c + 2 linear equations for 
the I < k + 2 variables ai, ... ,ai. Under appropriate linear 
independence conditions on the measures ff., these would 
imply that ai — for every 1 < i < I which would further 
imply that -01 = -0 = ■!/'2 and that tp is extreme. 

In the case when m < 1 however, no such explict linear 
independence conditions are necessary and, moreover, one can 
also give a geometric proof of the sufficiency characteriza- 
tion of the extreme points. We do this below in two parts: 
Lemma 4.2 deals with m ~ (i.e., extreme points of C) and 
Lemma 14.31 deals with the m — 1 case. 

Lemma 4.2: For every P,Q£ V2, the function ij^p^q is 
extreme in C. 

Proof: Fix two probability measures P and Q on {1,2} 
and let J denote the smallest open interval (possibly infinite) 
such that iPp.q{s) = min(l, s) for s ^ J. By explicitly writing 
down the expression for ip in terms of P{1} and Q{1}, it is 
easy to see that if J is non-empty, then ipp^q is linear on J. 

Suppose now that V'p.Q equals the convex combination 
{ipi + '02)/2 for two functions ipi and ip2 in C. If J is empty, 
then 4'P,Q equals the function min(l,s) for all s and since 
all functions in C and bounded from above by min(l,s), it 
follows that 



i'P,Q{s) = i^ii-s) = 02(s) = min(l,s) 



(35) 



function min(l,s) which is obviously extreme. So let us 
assume that J is non-empty. In that case, because P,Qe V^, 
it can be checked that ip is piecewise linear with atmost two 
segments in J. 

Suppose that ~ {ipi +ip2)/'^ for two functions iIji,iIj2 G 
Cj{fi,Di). Because, If,{ip) = Df^{P\\Q) = Di, it follows 
that 



IfAA) = IfA^2) = IfA^) = Df,{P\\Q) = D, 



(36) 



If ip has exactly one segment in J, then, by concavity, the 
inequalities ipi{s) > ^(s) and V'2(s) > "0(5) hold for all 
s. Because ijji and 02 average out to -0, we must then have 

= -01 = '02- 

Now suppose that -0 has exactly two segments in J^. Let 
a be the point in J such that tp is linear on both J D [0,a] 
and Jn [a, 00). We shall show that 0(a) = 0i(a) = 0'2(a). 
Concavity of -01 and 02 and linearity of 0^ on J n [0, a] and 
J n [a, 00) can then be used to show that ip = ipi = ip2. 
Suppose, if possible, that 0i(a) > 0'(a). Using the concavity 
of -01, it then follows that V'i(s) > 0(s) for all s E J. Because 
of (|36ll, it follows that 



(^i(s)-0(s))dj./,(s)= / (^i(s)-^(s))dz.;,(,s) = 



This implies that Vf^J) = 0. But then 

D, = If^ i^j) = (min(l, s) - 0(s)) di^f, (s) = 

which contradicts the fact that Di > 0. We have thus obtained 
that 01 (a) < ijj{a). Similarly, 0^2 (a) < ipia) and since "0(0) 
is an average of 0i(a) and 02(a), it follows that ip{a) = 
'ipi(a) = ip2{a)- The proof is complete. ■ 

V. Applications and Special Cases 

A. Joint Range of f -divergences 

The special case of Theorem 3.6 for to = k = (see 
Remark |3.3| l and Lemma |4.2| can both be combined to yield 
the following result. 

Theorem 5.1: A necessary and sufficient condition for a 
function in C to be extreme is that ip equals 0p,q for two 
probability measures P,Qe'P2. 

The goal of this section is to describe how Theorem 5.1 
can be used to determine joint range of /-divergences. Let 
C2 denote the class of all functions 0p,q for P,Q e 7^2 ■ 
Theorem 5.1 states that ext{C) equals C2. Just like the proof 

a 



for all s > 0. 

Let us therefore assume that ,/ is non-empty. In this case, 
again it is obvious that ([35| holds for s ^ J. Concavity of 
functions in C and linearity of in J would then imply that 
■01 > 0'p,Q and tp2 > 0'p,Q- Since 4'p,Q is the average of 0i 
and 02, this can happen only when ipp^q = ipi = 0'2- The 
proof is complete. ■ 

Lemma 4.3: Let j E {li2} and consider the class 
Cj{fi,Di) for Di > 0. For every P,Q E V3 with for every divergence Dj. Also, for every m > 1 and diver- 



of ( fTS] ), it can be shown that for every ipQ E C there exists 
Borel probabiUty measure tq that is concentrated on C2 such 
that 



Df^ {P\\Q) = Di, the function ipp.q is extreme in Cj{fi, Di). 

Proof: Fix two probability measures P and Q in Pa with 
Df^{P\\Q) = Di so that IfAi'p.q) = Di. For notational 
convenience, let us denote ipp.Q by 4'- As in the proof of 
Lemma |4.2| let J denote the smallest interval outside which 
■0(5) equals min(l,s). If J is empty, then ip equals the 



gences Df-^, . . . ,D y;^^ , we get that 



:f(0o) 



I(0)dTo(0) 



(37) 



where :— (//j('0), . . . ,If^{ip)) for e C. Because tq 

is a probability measure, it follows that the set {1(0) : tp E C} 
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equals the convex hull of the set {T{ip) : ?A E C2}. By 
Lemma 3.1 the set : tp E C} equals the joint range 



of the divergences D f-^, . . . , D f^, denoted by 7?.(/i, . . . , /m) 
and defined as the set of all vectors in M™ that equal 
{DfJP\\Q),...,Df,^{P\\Q)) for some pair of probability 
measures P and Q. The identity ( (37] i shows that the joint range 
equals the convex hull of 7^2(/l, . . . , /m) 
where 7?.2(/i, . . . , fm) is defined as 

{{Df,{P\\Q),...,Df,^{P\\Q)):P,QEV2). 

In other words, for every m > 1 and divergences 
D f-^, . . . , D f^, their joint range equals the convex hull of the 
joint range corresponding to two point probability measures. 
This result was proved in | |3T| for m = 2 where a method 
based on elementary calculus is also provided for calculating 
7?.2(/i,/2) for some special choices of fi and f2- 

There exist connections between joint ranges of /- 
divergences and the quantities A{Di,...,Dm) and 
B{Di, . . . , Dm)- Indeed, given divergences Df and 
Df. for i = l,...,m, knowledge of the joint range 
Tl{f, fi, . . . , f„i) should, in principle, allow us to determine 
both A{Di, . . . , D,n) and B{Di, . . . , Dm)- Because the joint 
range TZ{f, fi, - ■ - , fm) is determined by pairs of probability 
measures on {1,2}, it follows that both A{Di,...,Dm) 
and B{Di,...,Dm) can also be completely determined 
by probabilities on {1,2}. But computing A{Di,-..,Dm) 
and B{Di,...,Dm) this way would require considering 
even those probability measures P and Q on {1,2} for 
which the constraints Df.{P\\Q) < Dt do not all hold. 
On the other hand, if one would just restrict attention to 
those pairs of probability measures for which the constraints 
^fi{P\\Q) — hold, then, in general, it is necessary 
to consider pairs of probability measures on the larger set 
{1, . . . , m + 2}. This is because Theorem |2.1| is in general 
tight as we have demonstrated in Section |IV-B| (and also in 



Example 6.61 



B. Primitive Divergences 

In this section, we consider the case of the quantity 
B{Di, . . . , Dm) where all the divergences Df-^, . . . , Df^ are 



primitive divergences (see Remark 3. 1 1. In Theorem 5.2 below, 
we show that, in this case, B{Di, - . . , Dm) actually equals 
Bm+i{Di, . . -,Dm) as opposed to Bm+2{Di, . . -,Dm)- 

The problem of minimizing an /-divergence subject to 
constraints on primitive divergences and the related problem of 
obtaining inequalities between /-divergences and primitive di- 
vergences has received much attention in the literature and has 
a long history. Let us briefly mention some important works 
in this area. The most well-known such inequality is Pinsker's 
inequality which states that Dkl{P\\Q) > 2V^{P,Q) where 
Dkl is the Kullback-Leibler divergence which corresponds 
to f{x) = a; log a; and V is the total variation distance. 
Pinsker 1 18 1 proved this inequality with the constant 2 replaced 
by 1. The inequality with the constant 2 (which cannot be 
improved further) has been proved independently almost at the 
same time by Csiszar |2j, Kemperman and Kullback p9) . 



Although Pinsker's inequality is very useful, it is not sharp 
in the sense that 

mi{DKL{P\\Q) : V{P,Q) >V}> 2V^ 

for every V ^ 0- The problem of finding sharp inequalities 
between Dkl{P\\Q) and V{P,Q) was solved in |23| where 
an implicit expression for the infimum in the left hand side 
above was provided. 

The more general problem of finding the best lower bound 
for an arbitrary /-divergence given a lower bound on total vari- 
ation distance was solved by Gilardoni in |24) . The problem of 
finding lower bounds for /-divergences given constraints on a 
finite number of primitive divergences was studied by |25|. 



In Remark 5.1 we explain how our theorem below gives 



an equivalent but simpler solution compared to the solution 
of (25). 

Theorem 5.2: Suppose that Df is an arbitrary divergence 
and that all divergences D f-^, . . . , D are primitive diver- 
gences. Then 



B{Di 



,D„ 



B 



2.1 



states that 



,Dm)- 

BiD,, 



,Dm) 



Proof: Theorem 
equals Bm+2{Di, - . - , Dm)- We shall show therefore that 
Bm+2{Di, Dm) equals Bm+i{Di, . . . , Dm)- 
It is obvious that 



B 



m+2 



{Di,...,Dm) < Bm+liDi, - - - , Dm.) 



because we have a minimization problem and the constraint 
set is larger in the case of Bm+2{P'i, ■ ■ • , Dm)- It is therefore 
enough to prove that 



Bm+2{Dl 



1 Dm) > Bm+l{Di, . . . , Dm)- 



Fix two probability measures P — {pi, . . . ,Pm+2) and Q = 
(gi, . . . , qm+2) in Vm+2 with Df^{P\\Q) > Di for every i = 
1, . . . , m. We show below that 

Df{P\\Q) > Bm+l{Di,---,Dm) 

which will complete the proof 

Without loss of generality, we assume that pi + qi > for 
each i and that the likelihood ratios :— pi/qi G [0,oo] 
satisfy ri < ■■■ < rm+2- Because each divergence Df. is 
assumed to be primitive, the convex function fi is piecewise 
linear with exactly two linear parts. As a result, there exists 
some index j E {l,...,m + l} such that all the functions 
/i, • ■ • , fm are linear in the interval [rj, rj+i]- 

Now consider the two probability measures P* and Q* in 
Vm+i defined by 



P* 



(pi, . . . ,Pj-l,Pj +Pj + i,Pj+2, . . . ,Pm+2) 



and 



Q* (^ii ■ • • I Qj-iiQj + <7i+2, ■ • • , qrn+2) 



Because of the linearity of /i , . . 
to check that 



, fm on [rj,rj+i], it is easy 



Df,iP*\\Q*) = Df^(P\\Q) > A for an 1 < i < m. 
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As a result, we have 

Df{P*\\Q*)>B„,+i{D,,...,D^). 

On the other hand, by convexity or as a consequence of 
the data processing inequality for /-divergences (see, for 
example, y/Tj Lemma 4.1]), it follows that 

Df{P\\Q)>Df{P*\\Q*)>Bm+iiDi,...,D^). 

The proof is complete. ■ 
Remark 5.1: Let < si < • • • < s„i < oo and let 
Df. be the primitive divergence corresponding to fi — Ug^ 
(the functions Us. are defined in Remark |3.1[ ). Then the 
optimization problem corresponding to Bm+i{Di, ■ ■ ■ , Dm) 
can be written as: 



minimize Qi f { — ] + f (oo) p,- 



subject to > 0, gj > for all j = 1, . . . , TO + 1 

min {pj,qjSi) < min(l, Si) - Di 



(38) 



for i = 1, . . . , m. According to Theorem 5.2 the optimal value 
of this problem equals B{Di, . . . , D„i). As we mentioned 
before, the problem of determining B{Di, . . . , £>,„) when the 
divergences D f. are all primitive divergences has been studied 
by p5| . Their main result p5] Theorem 6] gives a charac- 
terization of B{Di, . . . ,Dm) that is much more complicated 
than ((38]l. However, the two forms are essentially equivalent. 



To understand the equivalence, observe that, by Lemma 3.1 



Df{P\\Q) can be written as an integral functional of ipp^Q- It 
is possible to precisely characterize the form of the function 
V'P,Q when P,Q G Vm+i- As a result, the optimization 
problem ( |38| ) can be reformulated in terms of such concave 
functions ijj. This, after some tedious algebra, leads to the 
formula for B{Di, . . . , Dm) given in |25j Theorem 6]. Our 
formula ( |38| l is much simpler and, moreover, is conceptually 
easier to understand. 

The special case of to = 1 in Theorem 5.2 asserts that in 



order to determine B{D) when Df^ is a primitive divergence, 
one only needs to consider probabilities on {1, 2}. This fact is 
well-known at least in the case when Df^ is the total variation 
distance (see, for example, [24, Proposition 2.1]). It is then 
possible to give a more direct expression for B{D) which is 
the content of the following lemma, whose special case for 
s — 1 appears in [24, Proposition 2.1]. 

Lemma 5.3: Let m = 1 and consider the quantity B{D) 
where 13/ is an arbitrary /-divergence and D is the primitive 
divergence corresponding to /i — Ug for a fixed s > 0. Then, 
for every < D < min(l, s), the quantity B{D) equals 



inf 

0<q<H/s 



(1-9)/ 



H -qs 
1-q 



qf 



1 + qs- H 



(39) 



where H := min(l, s) — D. 

Proof: We shall now show that B2{D) equals ([39]l. Note 
that ^2(0) = and (|39]l also equals when D = 0. To see 
this, note that it is trivially zero (because /(I) — 0) when 



s = 1 and when 57^!, then it is zero because the value at 
q — {1 — min(l,s))/(l — s) equals 0. So we shall assume 
below that D > 0. The optimization problem corresponding 
to B2{D) is: 



minimize 

P,9e[o,i]^ 



5: ,,/ m +/'(oo) E P= 

subject to pj > 0, qj > for j = 1,2 
Pi + P2 = <7i + <72 = 1 
min(pi,gis) + min(p2, <72s) = H. 



(40) 



Note that we have equality as opposed to < in the last 
constraint above. This is because of the fact that for every 
{PiiP2) and {qi,q2) lying in the constraint set for which the 
last constraint is not tight, we can get {pi,P2) and (51,^2) 
still lying in the constraint set with the last constraint satisfied 
with an equality sign and for which the objective function is 
reduced. 

We will now finish the proof by showing that the optimal 
value of the optimization problem ( |40] l is ([39|. Let {pi,p2) 
and (^1,(72) satisfy the constraint set with pi/qi < 1 < 
P2/q2- If s ^ [^1/91,^2/92], then clearly mm{pi,qis) + 
mm{p2,q2s) = min(l,s) and such {pi,P2) and (91,(72) do 
not satisfy the constraint set because D > 0. So we assume 
that s £ [pi/qi, p2 /q2] ■ In this case, the final constraint gives 
pi = H — q2S. We can therefore write each of pi.p2 and qi 
in terms of (72- Plugging these values in the objective function 
leads to the function in ([39]) (with q replaced by (72). The fact 
that each of pi,p2, qi and (72 need to lie between and 1 gives 
the constraint < q2 < H/s. The proof is complete. ■ 

For completeness, let us note the special case of the above 
lemma in the case of the total variation distance, which 
corresponds to s = L This result is due to Gilardoni [24\ 
Proposition 2.1]. 

Corollary 5.4 (Gilardoni): Let m = 1 and consider the 
quantity B{V) where Df is an arbitrary /-divergence and 
Dfi{P\\Q) equals V{P,Q), the total variation distance be- 
tween P and Q. Then, for every <V < 1, 



B{V) inf {T{q, V) : < q < I - V} 



(41) 



where 



T{q,V) := {l-q)f 



1-V-q 
1-9 



9/ 



q + V 



Consequently, for every pair of probability measures P and 
Q, we have the inequality 

Df{P\\Q) > inf {Tiq, V{P, Q)) : < <7 < 1 - V{P, Q)) 

(42) 

Moreover, this represents the sharpest possible inequality 
between Df and total variation distance. 

Although the expression (|4T]l cannot be simplified further 
in general, one can get much simpler expressions for B{V) 
in certain special cases. One such special case of interest 
corresponds to symmetric /-divergences. An /-divergence is 
said to be symmetric if the underlying convex function / 
satisfies the identity: f{x) = xf{l/x) for all x e (0, 00). It is 
easy to check that under this condition, one has Df{P\\Q) — 
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Df{Q\\P) for all P and Q. Examples of symmetric diver- 
gences include the total variation distance, squared Hellinger 
distance, triangular discrimination and the Jensen-Shannon 
divergence. The following result is due to Gilardoni p4j . We 
include it here for completeness and also because our proof is 
more direct than that in [24J . 

Corollary 5.5 (Gilardoni): Let m = 1 and consider the 
quantity B{V) where Df is a symmetric /-divergence and 
Df-^{P\\Q) equals V{P,Q), the total variation distance be- 
tween P and Q. Then, for every < < 1, 



B{V) = il-V)f 



1 + V 
1 - V 



(43) 



Consequently, for every pair of probability measures P and 
Q, we have 



Df{P\\Q)>{l-V{P,Q))f 



i + viP,Q) 

l-ViP,Q) 



(44) 



Moreover, this represents the sharpest possible inequality 
between the symmetric divergence Df and total variation 
distance. 

Proof: We shall show that the right hand side of ( |4T] ) 
equals the right hand side of ( |43] l when Df is a symmetric 
divergence. Consider the quantity T{q, V) defined in Corol- 
lary 5.4 Because f{x) = xf{l/x), it can be easily checked 
that 

T{q, V)=T{l-q- V, V) for all g e [0, 1 - V]. 

In other words, the function q i-> T{q, V) is symmetric in the 
interval [0, 1 — V^] about the mid-point (1 — V)/2. Moreover, 
as can be checked by taking derivatives (one-sided derivatives 
if / is not differentiable), q i-> T{q, V) is convex on [0, 1 — 
(this fact does not require / to be symmetric). These two facts 
clearly imply that 



inf_^T(,, = T (1^, y) . (1 - V)f (i±; 



which completes the proof. 



C. Chi-squared divergence 

In this section, we describe another situation where the 



conclusion of Theorem 2.1 can be further simplified. 

Theorem 5.6: Let m ~ 1 and consider the quantity A{D) 
where Df is the chi-squared divergence, x^iPWQ) which 
corresponds to f{x) := x'^ — 1. Also let the function /i 
be such that the function h : (0, oo) (0, oo) defined 
by h{x) := (1 + fi{x))/x is a strictly increasing, strictly 
convex, twice differentiable bijective mapping. Then A{D) — 
h~^{D + 1) — 1, where h^^ denotes the inverse function of h 
on (0, oo). 



Proof: By Theorem |2.1| A{D) equals the optimal value 
of the problem: 

r,2 



maximize 

p,?e[o,i]3 



oo- ^ Pj 

j:qj>0 j--qj=0 

subject to pj > 0, qj > for all j = 1,2,3 

El 



By convexity of h, we have 

h{x) > h{a) + h'{a){x - a) 



/((oo) P^^^ 



(45) 



for every a: > and a > 0. One consequence of this and the 
fact that h is strictly increasing is that 

h{l) + h'{l){x - 1) < h{x) < h{l) 

for all X E (0, 1). This implies that lima;^o xh{x) — and as 
a result 



/i(0) = lim/i(a;) = lim{xh{x) - 1) 

x^-O x],0 



-1 



Further, because h is strictly increasing, we have h'{a) > 
and thus 

/{(oo) — lim h(x) — oo 

a;— f oo 

which implies that we only need to consider P and Q for 
which J2j-qj=oPj ~ ^- Writing ( |45] l in terms of fi{x), we 
obtain 

1 + fiix) > X {h{a) - ah! (a)) + x'^h!(a). 

for every a; > and also at a; = (because /i(0) :~ 
limj;^o /i(a^))- Applying this inequality to a; = Pjjqj for 
qj > and then multiplying by qj, we obtain 

Ij + QjfiiPj/lj) > P] (^(a) - ah' (a)) + -^h'{a) 
for each j = 1, 2, 3. As a result, we get 

h'{a) (-) + ^ " + 

Because P and Q satisfy the constraint, we have 



< D 



and hence 



y ^-i< 

j:qj>0 



D + 1- hia)+ah'{a) 
h^Xa) 



Because a > is arbitrary, we get 

'D + l-h{a) + ah'{a) 



AiD) < inf 

a>0 



h'{a) 



- I. 



By elementary algebra, the above infimum is achieved at a* = 
h-^{D + l) and we then obtain A{D) < h'^iD + To 
see that A{D) is exactly equal to /i~^(Z) + l) — 1, observe that 
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the probabilities P = (1,0,0) and Q = (l/a*,l - l/a*,0) 
satisy Df,{P\\Q) = D and x^(^'ll<9) = h-\D + 1) - 1. ■ 
The function fi{x) = — 1 for Z > 2 clearly satisfies 
the conditions of the above theorem. We therefore obtain the 
following result as a simple corollary. 

Corollary 5.7: Let m — 1 and consider the quantity A{D) 
where Dj:{P\\Q) = X^i^WQ) and Df^ is the power diver- 



gence, _D('^(P||(5), corresponding to fi{x) 



Then A{D) 
inequality 



(1 + ^) 



1 for / > 2. 



1. This yields the sharp 



x^iP\\Q) + i<(i + D^'HP\\Q) 



1/0-1) 



between the chi-squared divergence and power divergence for 
I > 2. 

VI. Numerical Computation 

In this section we explore numerical methods for solving 
the optimization problems (|7]i and (|8]l in order to compute 
A{Di, . . . , Dm) and B{Di,...,Dm) respectively. In Sec- 
tion [VLA we consider the special case when Df is a primitive 
divergence. This special case is motivated by the statistical 
problem of obtaining lower bounds for the minimax risk and 
we show that the quantity A{Di, . . . ,D„i) can be computed 
exactly via convex optimization for every to > 1 and every ar- 
bitrary choice of D , . . . , D . In Section |VI-B| we consider 



the special case m = 1 and demonstrate that (|7|i and ([8]) can be 
solved for practically any pair of /-divergences by a gridded 
search over the low-dimensional parameter speace. We verify 
several known inequalities and also improve on some existing 
inequalities that are not sharp. 

A. Maximizing Primitive Divergences 

In this subsection we consider maximizing a primitive 
divergence subject to upper bounds on arbitrary /-divergences. 
While this optimization problem is not a-priori convex, we 
reduce it to a collection of convex problems. 

The optimization problem (|7]i where Df is a primitive 
divergence is, of course, closely related to the problem of 
bounding from above a primitive divergence subject to upper 
bounds on other /-divergences. This latter problem arises in 
obtaining lower bounds for the minimax risk in nonparametric 
statistical estimation (see, for example, |j5]-|j8)). For example, 
Le Cam's inequality, which is a popular technique for ob- 
taining minimax lower bounds, says that the minimax risk is 
bounded from below by a multiple of the Li affinity between 
two probability measures P and Q, where the Li affinity 
is defined as 1 — V{P, Q). The Li affinity also appears in 
Assouad's Lemma, another technique for obtaining minimax 
lower bounds. Evaluating V{P, Q) is hard because P and Q 
are typically product distributions of the form P — <E)'2=iPi 
(or mixtures of such distributions), so it is difficult to express 
V{P,Q) in terms of V{Pi,Qi) (which can be easier to 
compute). 

Application of Le Cam's inequality in practice, therefore, 
requires one to obtain a good upper bound the total variation, 
V{P,Q). One typically first bounds Df{P\\Q) for an /- 
divergence that decouples for product distributions such as 



squared Hellinger, chi-squared, or Kullback-Leibler divergence 
and then translates this into a bound on V{P, Q). It is common 
to use crude bounds like Pinsker's inequality for this purpose, 
and we believe there is room for improvement by using tight 
bounds. In particular, the constants might improve, addressing 
a common criticism of minimax lower bound techniques. 



Theorem 6.1 below solves the problem of maximizing 
a primitive divergence given constraints on m other 

divergences I?/, exactly via convex optimization. This leads 
to a fast algorithm with well-studied convergence properties. 

For each m > 1, let 

5™ - {a e {-1, l}™+2 : a, < (Tj for i < j} 

For each a E S„i, let us consider the following con- 
vex optimization problem and denote its optimal value by 
V^{Di,...,Dm). 



maximize 

p,ge[0,l]"+^ 

subject to 



m+2 

j=i 

Pj > 0, qj > for all j — 1, 



(46) 

for i = 1, . . . , m. Note that this problem is convex because 
the objective function is linear and the constraint set is convex 
in pi, . . . ,Pm+2, qi, ■ ■ ■ , qm+2- The fact that the constraint set 
is convex is a consequence of the convexity of Df.{P\\Q) in 
{P, Q) (see, for example, 1 17 Lemma 4.1]). It is also clear that 



this is a 2m + 2-dimensional optimization problem because 
there are 2to + 4 variables in all which satisfy two linear 
equality constraints. 

Theorem 6.1: Let Dj denote the primitive /-divergence 
corresponding to f ~ Us for some s > 0. Then 



A{Di,...,Dm) = - 



s-l 



max Vo-(I?i, 



,Dm) (47) 



Consequently, A{Di^ . . . , Dm) can be computed by solving 
the \Sm\ = TO + 3 convex optimization problems (|46|. 



Proof: Theorem 2.1 asserts that A{Di, . . . ,Dm) equals 
the optimal value of the optimization problem Note that 
the constraint sets of the problems (|7]l and (|46| are the same. 
Let us denote this constraint set by C„j so that 



AiD, 



max D.,^{P\\Q). 



The objective of (|7]i can be written as 

m+2 

DuAP\\Q) = min(l,s) - E inin(pj,sgj) 

j=i 

= miii(l, s)- -J2pj+ ~ \Pj ~ 



3 = 1 



m+2 



max 

0.g{-l,l}™ + 2 



J=l 
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Because two maxima can always be interchanged, we have 



max 

P,QeC„ 



m+2 



max > tr," (p., — so, ) 
(Te{-i,i}™+= ^ 



= max 
(Te{-i,i}™+2 



m+2 



Note that the inner maximization in the right hand side above 
is precisely the convex problem ( |46] l. 

Because the optimal value of ( |46| ) is invariant to permuting 
the indices of a, we have the reduction 

max max (t^{P~sQ) = max max (P~sQ). 

cre{-i,i}"+2 p,QeC,„ o-e5„ RQeC„ 

This shows that we can restrict attention only to those prob- 
lems (|46| for (7 G Sm- It is obvious that \Sm\ — m + 3. The 
proof is complete. ■ 

Example 6.2: Consider the special case of Theorem |6.1| 
when m — 1, s — 1 and when Df-^ is the squared Hellinger 
distance which corresponds to fi{x) = {y/x — 1)^/2. In 
other words, we consider the problem of maximizing the total 
variation distance subject to an upper bound on the Hellinger 
distance. The solution to this problem given by Theorem l6T| is 
plotted in Figure [ija). Each red dot shows A{H) =: Aj/jH) 
computed by solving the four 4-dimensional convex optimiza- 
tion problems ( |46] l (each corresponding to a cr G Si). 

Note that the quantity Aj^ {H) can be obtained analytically 
in a closed form. Indeed, since /i is a symmetric divergence, 
the sharp inequality bounding the total variation distance by 
the squared Hellinger distance is given by ( |44| l with f{x) — 
{y/x — 1)^ (this inequality is usually attributed to |[32J) which 
implies that 




We have plotted this function analytically by the solid cyan 
line in Figure [TJ a). It is clear that our numerical optimization 
method given by Theorem |6.1| agrees with the known analyt- 
ical bound. 

Example 6.3: For another simple application of Theo- 
rem 6.1 consider maximizing the total variation subject to 
an upper bound on the Kullback-Leibler divergence. In other 
words, we take rn = 1, s = 1 and fi{x) = xlogx 
and plot the solution given by Theorem |6.1| in Figure [TJb). 
Each black dot shows A{K) -. A]^^{K) for a different 
value of K, computed by solving the four 4-dimensional 
convex optimization problems ( |46] l. The solid green line shows 
Pinsker's analytic upper bound \/2K which is not sharp for 
any K > Q. 

Example 6.4: We now consider maximizing the total vari- 
ation subject to constraints on both the Hellinger distance 
and Kullback-Leibler divergence. In other words, we take 
m = 2, s = 1, fi{x) ~ {\/x — 1)^/2 and f2{x) — xlogx. 
To the best of our knowledge, there does not exist a closed 
form analytical solution to this problem. However, numerical 



solution is straightforward by Theorem 6.1 as shown below. 
According to Theorem |6.1| for fixed H,K > we 
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Fig. 1: Two simple applications of Theorem 6.1 



can compute A{H,K) =: Aj^j^{H,K) by solving five 



discussed 

in examples |6.2| and |6.3| Here and in all subsequent plots 
we set the axis limits to the maximum value of the relevant 
/-divergence and to 5 in the case of the Kullback-Leibler 
divergence (which has no maximum value). 



6-dimensional convex programs (|46|. Figure |2] shows the 
function Ajj^j^^{H, K) interpolated from 14884 {H, K) pairs. 
We used CVX in MATLAB to solve the convex programs. 
The height of each point in the surface shows how large the 
total variation can be when the squared Hellinger distance and 
Kullback-Leibler divergence are bounded by H and K re- 
spectively. As expected, the total variation is zero when either 
H = Q or K = Q, and it approaches 1 for large values of H 
and K. Next, observe that the surface Afj^j^ {H, K) is flat as 
K varies for small H, and vice-versa flat as H varies for small 
K. This is because only one constraint is tight in these regions. 
In other words, the surface Aj^j^ {H, K) is approximately 
the point-wise minimum of the two surfaces Ajj^{H) and 
Aj}^{K), with a diagonal ridge at the intersection of these 
two surfaces. But, as can be seen in Figure |3] our bound 
that simultaneously leverages both single-coordinate bounds is 
strictly better than the simple minimum of those two individual 
bounds for some {H,K). In other words, there exist {H,K) 
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such that 

min (Ar(^), All{K)) - Al\^{H, K) > (48) 

The left hand side above is positive when both single- 
coordinate bounds are informative, i.e. when both constraints 
in the optimization problem (|7]) are active. We will explain 
later (see Example 6.5 and Figure |4]i that the location of this 
ridge is predicted by an inequality between Dh{P\\Q) and 

DkUpWQ). 








Fig. 2: The height of each point in the surface above shows 
Ajj^]^{H, K) for a different {H,K) pair-the the least upper 
bound on total variation when squared Bellinger distance 
and Kullback-Leibler divergence are bounded by H and K 



respectively (see example 6.4 1 
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Fig. 3: Improvement over simple point- wise minimum of 
single-coordinate bounds. The color of the pixel at {H, K) 
represents the magnitude of the left hand side of (|48|. The 
bright region corresponds to (iJ, K) for which the bound 
displayed in Figure [2] is a strict improvement over the simple 
pointwise minimum of the two bounds shown in Figure [T] 



B. The General Case 



and (|8]l can be solved by convex optimization algorithms. 
However, if m is not too large, heuristic optimization tech- 
niques can be used. We demonstrate this in this subsection for 
m — 1. 

Example 6.5: Consider the optimization problem (|7]) for 
m — 1, f{x) = {y/x — l)'^/2 and fi{x) = xlogx. In other 
words, we consider the problem of maximizing the squared 
Bellinger distance subject to an upper bound on the Kullback- 
Leibler divergence. The optimization problem ^ is clearly 
4-dimensional (there are six variables in all pi,p2,P3 and 
111 12, 93 but they satisfy two linear constraints as they sum to 
one). Because the variable space is only 4-dimensional, there 
was no trouble solving this by gridding the parameter space. 
We plot the solution in Figure |4|a) where each blue dot shows 
A{K) -. A^^{K) for a different value of K. 

The quantity A^^{K) can be used to better understand 
the inequality ( |48] l. Indeed, when we overlay the curve 
(if, j4|^^(i^)) on Figure [3] (see Figure |4]^b)), we see that the 
curve {K,A^^{K)) (plotted by the blue line) lies above the 
region where the inequality ( |48] l holds. Only the constraint on 
Dkl{P\\Q) is a ctive in the optimization problem considered 
in Example [m] when H > A^^{K). For such {H,K), 



when H > A^^{K). For such 
therefore, the inequality (|48| does not hold. 

Example 6.6: Consider maximizing the squared Hellinger 
distance between P and Q with the total variation between P 
and Q, V{P, Q), bounded by V. In other words, we consider 
the special case of the problem (|7| for m = 1, f{x) = {y/x — 
1)^/2 and fi{x) ^ \x — l|/2. This is a special case of the 
problem we considered in section |IV-B| where we proved that 
A2{V) < A3{V) for all V £ (0, 1). Here we confirm this fact 
numerically. 

We compute both the quantities A2{V) and A3{V) by 
a gridded search over pairs of probabilities satisfying the 
constraint in V2 and respectively. These functions are 
plotted in Figure |5] Each red triangle in Figure |5] shows A^lV) 
for a different V. Each point in the dotted blue line shows 
A2{V) for a different V. It is evident that the inequality 
A2{V) < A^iV) holds for all V € (0,1). In other words, 
when we restrict the constraint set to probability measures in 
V2, the maximum Hellinger distance is strictly smaller for all 
V £ (0,1). Therefore, Theorem 2.1 is in general tight and 
cannot be improved. 

Note also that the plot A^lV) agrees with the form 
Aaj V) = A{V) = V{f{0) + /'(oo)) = V derived in Sec- 
tion 



IV-B 



This gives rise to the sharp inequality H^{P, Q) < 
V{P,Q) which is again attributed to |32|. 

Example 6.7: The capacitory discrimination between two 
probability measures P and Q is defined by 



C{P, Q) ^Dkl\P 



,P + Q 



D 



KL 



Q\\ 



P + Q 



It is easy to check that C{P, Q) is an /-divergence that 
corresponds to the convex function: 



xlogx - [x + I) \og{x + 1) + 2 log 2. 



(49) 



Theorem 6.1 requires Df to be a primitive divergence. We 
do not know if, in general, the optimization problems (|7]) The triangular discrimination A(P. Q) is another /-divergence 
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Fig. 4: A sharp inequality between squared Bellinger dis- 
tance and Kullback-Leibler divergence bounds the support 
of the ridge. The upper panel displays a sharp inequality 
between squared Hellinger and Kullback-Leibler divergence. 
The height of each blue dot represents the optimal value 
A^^{K) with a different constraint, K, on the Kullback- 
Leibler divergence. The lower panel shows the same blue curve 
overlaid on Figure [3] Observe that the region with positive 
improvement is bounded by the red line from the upper panel. 



that corresponds to the convex function 

x + l 



(50) 



Tops0e proved the following inequality between these two /- 
divergences p3) : 

iA(P,Q) < C{P,Q) < (log2)A(F,Q). (51) 
Let us investigate here the sharpness of these inequalities. Let 

AiDi) sup {C(P, Q) : A(P, Q) < D^} 

and 

B(Di) := inf {C{P, Q) : A(P, Q) > . 
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Fig. 5: Three point measures strictly improve on two point 
measures. Each red triangle shows A3{V) computed by a grid- 
ded search over pairs of probability measures in Vs- Each blue 
dot shows A2{V) computed by a gridded search over pairs of 
probability measures in 7^2 • The simulation over three point 
measures is exactly a straight line with slope one-agreeing 
with Le Cam's bound < V. And A2iV) < A^iV) for all 
Fe (0,1). 



We solved the optimization problems (|7]) and ([8]) for m = 1, 
f{x) given by (|49]l and fi{x) given by ( [SO] ) by a gridded 
search. The resulting solutions for A{Di) and B{Di) are 
plotted in Figure [6] with red triangles corresponding to A{Di) 
and blue dots corresponding to B{Di). We have also plotted 
the bounds given by ( |5T] i in Figure |6] with the green line 
corresponding to (log2)I?i and the blue line to Di/2. It is 
clear from the figure that the upper bound in ( [5T] i is sharp 
while the lower bound is not sharp. The sharp lower bound is 
given by B{Di). We are unaware of an analytic formula for 
B{Di), but we conjecture that i?2(-Di) — B^{Di) because 
this equality holds numerically. It may be possible to use this 
fact to find an analytic formula for B[Di). 
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